Morphological features help POS tagging of unknown words across language varieties
نویسندگان
چکیده
Part-of-speech tagging, like any supervised statistical NLP task, is more difficult when test sets are very different from training sets, for example when tagging across genres or language varieties. We examined the problem of POS tagging of different varieties of Mandarin Chinese (PRC-Mainland, PRCHong Kong, and Taiwan). An analytic study first showed that unknown words were a major source of difficulty in cross-variety tagging. Unknown words in English tend to be proper nouns. By contrast, we found that Mandarin unknown words were mostly common nouns and verbs. We showed these results are caused by the high frequency of morphological compounding in Mandarin; in this sense Mandarin is more like German than English. Based on this analysis, we propose a variety of new morphological unknown-word features for POS tagging, extending earlier work by others on unknown-word tagging in English and German. Our features were implemented in a maximum entropy Markov model. Our system achieves state-of-the-art performance in Mandarin tagging, including improving unknown-word tagging performance on unseen varieties in Chinese Treebank 5.0 from 61% to 80% correct.
منابع مشابه
سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملA Hybrid Morphology-Based POS Tagger for Persian
In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach high performance and accuracy. These taggers usually deal with inter-word relations and they make use of lexicons. In this paper we present a new tagging algor...
متن کاملMorphological Ending – based Strategies of Unknown Word Estimation for Statistical POS Urdu Tagger
Natural language processing has widely used Statistical based language models to solve disambiguation problems. Over the past decades different techniques regarding POS tagging have been proposed for English, European and East Asian languages. In this paper our focus is POS tagging for Urdu due to the infancy stage of Urdu language based tagging system. We have combined two approaches (Statisti...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کاملTigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya CorpusTigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus
This paper presents the first part-of-speech (POS) tagging research for Tigrinya (Semitic language) from the newly constructed Nagaoka Tigrinya Corpus. The raw text was extracted from a newspaper published in Eritrea in the Tigrinya language. This initial corpus was cleaned and formatted in plaintext and the Text Encoding Initiative (TEI) XML format. A tagset of 73 tags was designed, and the co...
متن کامل